9/29/2018

Some relevant bits of my story

  • Software Engineer and lazy in all the right ways
  • Worked with some Expert Systems folks for a couple years at a place known for decision-making
  • Inability to deal with uncertainty bothered me to no end
  • Took time off to do Recurse Center and dive in to this problem
  • Realized the majority of people trying to solve problems like this were doing something something Bayesian something graphical models

How do we make decisions?

  • Try to understand the state of the world
  • Guess at the processes through which the world reacts to stimuli
  • Associate utility scores with possible world states, and model relative chance
  • Often, communication of findings is key

Example problem - mail delivery

  • I get a few items of mail most days, but some days I get 0
  • How often should I be experiencing no-mail days?
  • I suspect the USPS may have an efficiency policy in place that suggests skipping some postal customers each day
  • I could use statistical analysis to make a case to a reporter that it is worth investigating
  • Imagine I care a lot about the mail and that the reporter is very busy

Why is this hard?

  • Hard to represent the state of our knowledge
    • We have uncertainty around even the most deterministic of processes
    • And limited resources to discover the true process
  • Many possible models for a process
    • We will never know the True Model
    • Nor the True State of the World

Mail delivery uncertainty

  • Mail is relatively easy to measure, but the process that generates mail is incredibly complex
  • Poisson distribution makes a lot of sense, but
    • We’d like to see how well it can describe our data
    • And keep track of our uncertainty when making predictions or decisions
    • We suspect the Poisson is being fucked with
  • Visually explain to our reporter friend in terms they find compelling

What does frequentism preach?

  • Set up your experiment correctly and then use some off-the-shelf test or estimator to get The Best Answer (point estimate)
    • Oh, and make sure this list of assumptions all hold
  • No parameter uncertainty, because probability only represents long-run frequency
  • “No” priors; incorporate minimal expert knowledge
  • To unfairly paraphrase, one should either be doing simpler experiments or much more difficult math
  • In this case, construct two models and compare some information criteria, probably

What does machine learning preach?

  • We have computers now, fuck calculus!
    • fair enough
  • Great success with algorithms that have only later had statistical interpretations ascribed to them
  • Progress = predictions 0.01% better than the last paper according to cross validation
    • Let’s ignore that the noise around cross-validated accuracy scores is much higher than 0.01%
  • Why would we need to interpret the model if its predictions are accurate?

What does Bayesianism preach?

  • Philosophically, inference without a prior is impossible
    • In frequentism it’s buried in an assumption
    • In ML you are not really sure if there is a deeper meaning under the code in the first place, but there is some implied prior
  • Retain a distribution as long as possible before making a decision
  • We also have computers!
    • Somehow, only more recently
    • We can actually express an accurate, bespoke model for our data generating process without worrying about an analytic integral
  • Construct many models and Always Be Graphing

What are we modeling?

  • “If we knew our parameter values, how would data be generated?” \[\pi(y \, | \, \theta)\]
  • But we want to learn \(\theta\)
  • \[\pi(\theta \, | \, y) = \frac{\pi(y \, | \, \theta) \pi(\theta)}{\pi(y)}\]
  • In our case, our parameters \(\theta\) might be the average amount of mail we get on any given day, or the percentage of days the carrier skips
  • But really, we want to compare a model with skipped days against one without

Example problem Stan model

data {
  int<lower=0> num_days;
  int<lower=0> mail[num_days];
}
parameters {
  real<lower=0> avg_mail_per_day;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  mail ~ poisson(avg_mail_per_day);
}

Fitting a Stan model with PyStan

import pystan, numpy as np
model = pystan.StanModel("usps1.stan")
data = dict(mail = np.array(r["mail"], dtype="int"), num_days=int(r["num_days"]))
fit = model.sampling(data)
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd   2.5%     50%  97.5%  n_eff   Rhat
avg_mail_per_day   4.96  2.3e-3   0.08    4.8    4.96   5.11   1316    1.0

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

How do we interpret the fit summary?

  • Low sd
  • Good diagnostics (Rhat < 1.1, no divergences or other warnings)
    • For more on diagnostics and proper workflow, go to @betanalpha’s master class tomorrow!
  • Assuming this model is true, we seem pretty confident that I get about 5 pieces of mail a day.

Visually communicating model fit

from matplotlib import pyplot as plt
mail_rate = fit.extract()["avg_mail_per_day"] # 4000 posterior samples for this param
fake_data = np.random.poisson(mail_rate) #4000 fake data simulations
plt.hist([fake_data, data["mail"]], bins=15, density=True, color=["red", "blue"])

Add skip_percentage to model skipped days

parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += without_skipping_density;
  }
}

Fit model with skip_percentage

Inference for Stan model: anon_model_7abb4b18761050015bf05aee642e2a30.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd    25%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.54  2.0e-3   0.09   5.48   5.54   5.73   2231    1.0
skip_percentage    0.99  2.4e-4   0.01   0.98   0.99    1.0   2815    1.0
lp__              -1581    0.03   1.06  -1581  -1580  -1580   1535    1.0

Samples were drawn using NUTS at Sat Oct 13 16:17:55 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Simulate data implied by parameter draws

mail_rate15 = fit15.extract()["avg_mail_per_day"]
skip_percentage = fit15.extract()["skip_percentage"]
fake_data15 = np.zeros(len(mail_rate))
for i in range(len(mail_rate15)):
    todays_mail =  np.random.poisson(mail_rate15[i])
    if np.random.binomial(1, skip_percentage[i]):
        fake_data15[i] = 0
    else:
        fake_data15[i] = todays_mail

Visually compare simulations with observed data


Fix our math; intro log scales

parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += bernoulli_lpmf(0 | skip_percentage) + without_skipping_density;
  }
}

Fixed fit

Inference for Stan model: anon_model_387876dd9af8a07353d37466d2bde6fb.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd    25%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.52  1.6e-3   0.09   5.46   5.52    5.7   3350    1.0
skip_percentage     0.1  2.0e-4   0.01    0.1    0.1   0.13   3072    1.0
lp__              -1821    0.02   0.97  -1821  -1821  -1820   1884    1.0

Samples were drawn using NUTS at Sat Oct 13 13:45:07 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

PPC plot for corrected model


Pure poisson (left) vs. with skipped days (right)


What would frequentism say to do here?

  • Plug the model (sans prior) and data into an estimator like maximum likelihood (w/ implied uniform prior)
  • Get out the mode and perform a p-test - are the results significant?
  • Compare models with chi-squared test on the difference of log likelihoods

Real life frequentist output

Call:  glm(formula = mail ~ 1, family = "poisson")

Coefficients:
(Intercept)  
      1.601  

Degrees of Freedom: 729 Total (i.e. Null);  729 Residual
Null Deviance:      1702 
Residual Deviance: 1702     AIC: 3961

What is wrong with optimization / MLE?

Area in higher dimensions is weird


(from Betancourt’s Conceptual Introduction to HMC)

Area in higher dimensions is weird


(from Betancourt’s Conceptual Introduction to HMC)

Curse of dimensionality

Curse of dimensionality

  • “Somewhat counterintuitively, the average member of a population is an outlier. How could an item with every feature being average be unusual? Precisely because it is unusual to have so many features that close to average.” -Carpenter
  • Recall the Airforce’s attempt to design a cockpit for the average pilot

Curse of dimensionality


(from Mackay’s book)

How Stan’s HMC is different from optimizing